Fault Tolerant Matrix Operations for Networks of Workstations Using Multiple Checkpointing
نویسندگان
چکیده
Recently, an algorithm-based approach using diskless checkpointing has been developed to provide fault tolerance for high-performance matrix operations. With this approach, since fault tolerance is incorporated into the matrix operations, the matrix operations become resilient to any single processor failure or change with low overhead. In this paper, we present a technique called multiple checkpointing to enable the matrix operations to tolerate a certain set of multiple processor failures by adding the capacity for multiple checkpointing processors. The results on a network of workstations have shown that this technique improves not only the reliability of the computation but also the performance of checkpointing.
منابع مشابه
Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing
Networks of workstations (NOWs) offer a cost-effective platform for high-performance, long-running parallel computations. However, these computations must be able to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several fault-tolerant algorithms for distributed scientific computing. The fault-tolerance is based on diskless chec...
متن کاملAlgorithm-Based Diskless Checkpointing for Fault Tolerant Matrix Operations
This paper is an exploration of diskless checkpointing for distributed scienti c computations. With the widespread use of the \Network Of Workstation" (NOW) platform for distributed computing, long-running scienti c computations need to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several algorithms for distributed scienti c c...
متن کاملSTAR: A Fault-Tolerant System for Distributed Applications
This paper presents a fault-tolerant manager for distributed applications. This manager provides an efficient recovery of hosts’ failures on networks of workstations. An independent checkpointing is used to automatically recover application processes affected by host failures. Domino-effects are avoided by means of message logging and file versions management. STAR provides an efficient softwar...
متن کاملStability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid
Grid Computing allows coordinated and controlled resource sharing and problem solving in multi-institutional, dynamic virtual organizations. Moreover, fault tolerance and task scheduling is an important issue for large scale computational grid because of its unreliable nature of grid resources. Commonly exploited techniques to realize fault tolerance is periodic Checkpointing that periodically ...
متن کاملFault-tolerant Distributed Applications In LiPS
Performing computations using networks of workstations is increasingly becoming an alternative to using a supercomputer. This approach is motivated by the the vast quantities of unused idle-time available in workstation networks. Unlike computing on a tightly coupled parallel computer, where a xed number of processor nodes is used within a computation, the number of useable nodes in a workstati...
متن کامل